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1 Introduction 


With the development of Breeding 4.0, new genotyping and phenotyping 
tools are needed to help the breeding process to increase the productivity of 
genotypes (Van Eeuwijk et al., 2019, Wallace et al., 2018). This includes a trend 
to integrate multiple layers of genomics, high-throughput plant phenotyping 
(HTPP), and large-scale envirotyping to improve the prediction of complex traits 
(Crossa et al., 2021, Cooper et al., 2014). Whole genome-enabled prediction, 
referred to as genomic prediction (GP) or genomic selection (GS), is the main 
approach to integrate these new tools into breeding programs for supporting 
the delivery of high- and sustainable-yielding cultivars. The main goal of GS is 
to predict complex traits based on marker information, increasing the precision 
of selection by generating a genomic estimated breeding value for candidates. 
Therefore, GS is potentially superior to phenotypic selection for increasing 
genetic gains per unit time and shortening the length of the breeding cycle 
(Crossa et al., 2017). Recently, the requirements of breeders are increasingly 
shifting towards accommodating HTPP data and environmental information 
into the multi-environment trial analysis (Araus et al., 2018). However, it is 
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uncertain whether the increasing number of phenotypes measured and the 
depth of information from the new phenotyping platforms will improve the 
prediction of quantitative traits. 

Improvements in yield have been achieved mainly by evaluating grain yield 
as the sole selection criterion. However, grain yield in cereals is an integrative 
trait resulting from the genetic ability of a plant to grow, to capture resources, 
and to transfer them into the grain in a specific environment. Determination of 
grain yield involves multiple processes, and it is therefore under multi-gene 
control with complex interactions with the environment. Reynolds and Langridge 
(2016) suggested that to accelerate yield improvement, physiological traits 
need to be considered for breeding in addition to the advanced genomics 
methods (Reynolds et al., 2012). Furthermore, since a comprehensive genetic 
basis explaining cultivar-level differences in performance does not yet exist for 
any crop, physiological breeding currently relies heavily on phenomics. 

Digital phenotyping tools enable rapid screening of populations in 
field conditions in a non-invasive and non-destructive approach. These tools 
make use of remote sensing and close-range technologies (from airplane to 
handheld device) for screening experimental trials with hundreds or thousands 
of plots in a fast, non-invasive, and non-destructive way (Araus et al., 2018). 
Digital phenotyping enables repeat measurements over time and thus provides 
valuable dynamic information on crop growth as well as crop physiology 
and phenology, to enable an understanding of the whole life-cycle growth 
performance of the genetic materials in relation to the end product, that is grain 
yield. In addition, conducting experiments at multiple locations should help 
to deliver optimal final products targeting specific environments with optimal 
yield and performance for farmers. However, while a wealth of information is 
now available from digital phenotyping, not everything is necessarily needed 
or even relevant for the breeding community. One important aspect to 
consider is the cost and the throughput of the technologies being deployed for 
breeding trials. Reynolds et al. (2012) emphasized that breeders are interested 
in reducing the size of their test populations to achieve the required genotype 
at a reasonable probability and cost, even if the methods might be seen as 
imperfect from a physiological and theoretical point of view. 

Digital phenotyping involving the application of remote sensing 
technologies and methods at spatially close proximal scales enables access 
to the different components of plant growth with unprecedented accuracy 
(in terms of temporal, spatial, and spectral resolution). Quantifying the 
morphological, phenological, and physiological traits at plot scale is now 
possible even if some approaches still require refinementto replace traditional 
approaches. The use of optical sensors offers the possibility to assess biomass, 
nitrogen, and chlorophyll estimation along the plant development cycle, 
as well as physiological parameters linked to photosynthesis and light use 
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efficiency (LUE) of the crop. Thermal infrared technology measuring leaf and 
canopy temperature provides information on plant water transpiration status 
and water use. Lidar technology gives access to plant stature and morphology. 
Visible imaging, using simple camera technology, offers the greatest versatility, 
providing high pixel resolution and recording information akin to the human 
eye. This is extremely valuable as traditionally breeders rely tremendously on 
visual observation, the 'breeder's eye’. 

High-throughput plant phenotyping records numerous traits with high 
spatiotemporal resolution at a level far beyond what humans can assimilate. As 
a result, a new method of selecting phenotypes has recently been proposed, 
known as 'phenomic selection' (Rincent et al., 2018). In this approach, data 
such as hyperspectral information are used the same way as molecular markers 
to make inferences about relatedness. In other words, phenotypic variables 
(referred to as secondary traits) replace genetic information in a traditional GP 
approach. Integration of both types of information (genetic plus phenomic) as 
predictors in genotype-to-phenotype (G2P) models has been used to improve 
the predictive ability for yield in wheat (Krause et al., 2019). To that end, the 
complexity of the new HTPP data raises many questions about how to efficiently 
integrate genetic and non-genetic components to improve the accuracy of 
prediction for complex traits. 

Applications of HTPP data for predictive G2P models may be grouped into 
three categories. The first strategy uses the phenotypic data generated from the 
platforms as the target phenotype (Lyra et al., 2020, Watanabe et al., 2017). The 
second strategy uses thousands of reflectance data points (hyperspectral bands 
or derived vegetation indices (Vls)), captured at many stages of crop development 
as secondary phenotypes (genotype-specific covariables), with G2P models to 
improve the prediction of primary (target) traits, that is grain yield (Rutkoski et al., 
2016, Sun etal., 2017). Athird strategy is modelling plant growth from longitudinal 
traits (e.g., light interception, biomass accumulation, canopy height), allowing the 
selection of high-yielding cultivars at the early stages of development (Moreira 
et al., 2020). Several authors have applied different statistical models to handle 
this data complexity, including multivariate analysis (Rutkoski et al., 2016, Sun 
et al., 2017), factorial regression (Van Eeuwijk et al., 2019), functional regression 
(Montesinos-Lopez et al., 2017b), multi-kernel regression (Krause et al., 2019), 
deep learning (Cuevas et al., 2019), regularized selection indices (Lopez-Cruz 
et al., 2020), mega-scale linear mixed model (MegaL MM) (Runcie et al., 2021), 
and crop growth models (Van Eeuwijk et al., 2019). Furthermore, with the recent 
developments of precision envirotyping platforms, the integration of new 
phenotypic and envirotypic information will become a new standard. 

Envirotyping is the process of collecting environmental factors (e.g. soil and 
climate information) in multi-environment trials (MET), intending to characterize 
the variation of the phenotypic performance of genotypes over, for example 
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environmental gradients (envirotype) (Cooper et al., 2014, Xu, 2016). Another 
promising conceptthat could be applied in crop breeding for a more optimized 
variety selection is 'enviromics', which is the integration of envirotypes with 
multiple enviromic markers which correspond to environmental variables that 
may interact with the genetic background (Resende et al., 2021). Envirotyping 
information has been used to explore envirotype-to-phenotype dynamics and 
has been incorporated in G2P models to improve the prediction of complex 
traits (Costa-Neto et al., 2020, 2021a, b, Millet et al., 2019, Porker et al., 2020). 

In this chapter, recent literature on incorporating new phenotyping data 
into predictive G2P models is explored, along with an introduction to the use 
of envirotyping information. The chapter is presented in two sections. First, the 
main ‘traits’ that are currently used and/or should be useful for breeding are 
introduced. Second, the most common statistical methods used to integrate 
markers, environmental information, and HTPP data are described. 


2 Digital phenotyping as a tool to support breeding 
programs 


Yield is considered a primary trait and is the goal in most breeding programs. 
For decades, breeding has been focused only on selecting genetic material 
based on the yield itself at harvest. However, as mentioned in the study by Van 
Eeuwijk et al. (2019), secondary phenotypes have proved useful as a covariate in 
prediction models for yield. These secondary phenotypes are defined as basic 
or intermediate traits (Bustos-Korts et al., 2019). The basic traits correspond to 
response mechanism/sensitivities to environmental conditions (e.g. sensitivity 
to photoperiod, water uptake capacity, radiation use efficiency), whilst the 
intermediate traits result from the integration of several processes over time 
(e.g. biomass accumulation, flowering time, grain number). 

The introduction of secondary traits in prediction models is called 
physiological breeding (Reynolds and Langridge, 2016). However, the application 
may be difficult to implement depending on the secondary phenotypes 
considered. For example, measuring height or scoring for heading or anthesis 
time is straightforward (despite the fact it is time-consuming), while other 
phenotypes such as biomass or photosynthesis are more constraining in terms 
of requiring destructive measurement or demands to collect data at a large scale 
in a short amount of time. Pask et al. (2012) summarized most of the secondary 
phenotypes used for breeding as well as the practical methods to collect them. 
The authors also brought to light some of the remote sensing methods applied 
at the plot scale, for example using spectroradiometers and thermal infrared 
sensors, as non-destructive and non-invasive technologies to produce proxies 
of biomass and transpiration traits, respectively. Although a vast amount of 
data are now available from the multiple technologies for phenotyping, such as 
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visual, multi-/hyperspectral imaging, thermal infrared, and fluorescence sensors, 
as well as lidar technologies, not all this information will be relevant for breeders. 
Digital phenotyping must balance the time of acquisition with the relevance of 
the proxies and the cost to deploy the technologies. 

In this section, we will review the secondary phenotypes measured by 
high-throughput digital phenotyping tools that are potentially useful for 
breeding. The secondary phenotypes will be considered in terms of the level 
of integration and complexity regarding their use as single/multi-indirect traits 
for yield prediction, covariates for yield prediction, or for integrating over time 
using growth curves to access new parameters summarizing growth events. 


2.1 Simple secondary traits 


This section summarizes secondary phenotypes considered as simple variables. 
In this case, 'simple' means that the variable is the output from a sensor and is 
not the result of the combination of different metrics or sensors. The secondary 
phenotypes are sub-divided according to the morphological, physiological, 
and phenological classification of the traits (Violle et al., 2007). 


2.1.1 Morphological traits 


Morphological traits such as height and canopy cover are the most 
straightforward variables to collect using either lidar technologies and/or 
visible cameras (Red-Green-Blue (RGB)). Height can be easily calculated not 
only from different lidar systems (Deery et al., 2014, Friedli et al., 2016, Virlet 
et al., 2016) but also from the structure from motion (SfM) principle when RGB 
cameras are coupled with an appropriate vector, such as unmanned aerial 
vehicles (UAV) or tractor-based systems (Holman et al., 2016, Jay et al., 2015). 
The output of these two approaches is called a point cloud, with each point 
having x, y, and z coordinates. Height is the easiest parameter derived from the 
point cloud. Recent studies have shown capabilities to estimate the volume of 
the canopy, the above-ground biomass (AGB), the leaf area index (LAI), or the 
stem diameter by processing the point clouds using voxelization (Che et al., 
2020, Hosoi et al., 2013, Jimenez-Berni et al., 2018, Salas Fernandez et al., 2017, 
Xiao et al., 2020). 

With sufficient spatial resolution, processing the point cloud enables access 
to other morphological parameters such as the leaf area and angle (Mantilla- 
Perez and Salas Fernandez, 2017), as well as the volume of the reproductive 
organs in cereal crops. Itis also possible to extractthe fraction vegetation cover 
(FVC) from the point clouds as in the study by Duan et al. (2016). However, 
FVC is generally computed from RGB images as the ratio of crop green pixel 
to the total number of pixels contained in the region of interest. Numerous 
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approaches have been developed to segment the green plant pixel from the 
background as the ambient illumination is a limiting factor for the efficiency of 
the segmentation (Casadesús et al., 2007, Guo et al., 2013, Hamuda et al., 2016, 
Sadeghi-Tehran et al., 2017a). Those methods range from simple threshold 
approaches to deep learning methods overcoming the illumination issues by 
training the algorithm with a wide range of illumination conditions. There are 
multiple applications of monitoring FVC such as enabling quantification of 
canopy development from emergence to maturity (Borra-Serrano et al., 2020, 
Sadeghi-Tehran et al., 20172, Varela et al., 2021). 

In recent years, there has been a growing interest in using RGB images 
to quantify the number of ears/panicles for the major grain crops (Duan et al., 
2015, Fernandez-Gallego et al., 2018, Lu et al., 2017, Xiong et al., 2017, Chandra 
et al., 2020, Ghosal et al., 2019, Liu et al., 2020, Madec et al., 2019, Sadeghi- 
Tehran et al., 2019, Velumani et al., 2020, Zhou et al., 2019). Many of these 
approaches rely on machine learning methods and require the acquisition 
of solid training data sets to encompass the various stages of maturation of 
the reproductive organs and the genetic variation. To facilitate such data set 
acquisition, a group of researchers from nine different institutions across seven 
countries and three continents developed the Global Wheat Head Detection 
data set that can be used to benchmark methods proposed in the computer 
vision community (David et al., 2020). 


2.1.2 Physiological traits 


Digital phenotyping also enables the measurement of physiological traits 
related to carbon and nitrogen metabolism. Physiological traits are mainly 
assessed using the spectral properties of the plant, whether in the optical 
domain or in the thermal infrared. Multi- and hyperspectral technologies 
(signal point or utilizing imaging systems) are used to derive spectral 
reflectance indices (SRIs), which are mainly related to biomass accumulation 
and component traits (light interception and light/radiation use efficiency). 
SRIs are obtained by the arithmetic combination of two or more wavelengths 
designed to highlight a particular property of vegetation. Each SRI is designed 
to accentuate a particular vegetation property (https://www.l3harrisgeospatial 
.com/docs/vegetationindices.html). 

More than 100 indices have been developed over the past 50 years. Most 
of the indices developed are related to the biomass accumulation and the light 
interception such as the normalized differences vegetation index (NDVI) and 
the developed derivatives to overcome NDVI limitations (Corti et al., 2018, 
Cabrera-Bosquet et al., 2011, Gutierrez et al., 2012, Haboudane et al., 2004, 
Yue et al., 2017). Chlorophyll and nitrogen leaf/canopy content as well as 
other pigments have been extensively investigated over recent decades using 
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vegetation reflectance properties (Berger et al., 2020, Blackburn, 2007, Boegh 
et al., 2002, Camino et al., 2018a, Cammarano et al., 2014, Fitzgerald et al., 
2010, Gitelson and Solovchenko, 2018, Jay et al., 2015). Post-harvest traits 
related to nitrogen have been also explored extensively in the literature (Erdle 
et al., 2013, Frels et al., 2018, Pavuluri et al., 2015, Prey et al., 2020, 2019, Prey 
and Schmidhalter, 2020, 2019). The above-mentioned agronomical traits reflect 
the process of biomass accumulation, which can be monitored and assessed 
using different Vls such as NDVI and NDVI-like indices. However, those indices 
do not report short-term variation that might occur with changes in ambient 
conditions (Dobrowski et al., 2005). Among the SRIs, the photochemical/ 
physiological reflectance index (PRI), a normalized reflectance index that uses 
the 531 nm and 570 nm wavelengths, is generally used as a direct method 
to assess LUE over short periods (Gamon et al., 2016). PRI has been shown 
to negatively correlate with non-photochemical quenching (NPQ) and the 
de-epoxidation state of the xanthophyll cycle (Evain et al., 2004, Peguero- 
Pina et al., 2008, Porcar-Castell et al., 2012, Rascher et al., 2007) and positively 
correlate with steady-state fluorescence, F', and the photosystem lI (PSII) 
operating efficiency ( F; (E, ) under differing irrigation regimes in controlled or 
natural environmental conditions (Evain et al., 2004, Peguero-Pina et al., 2008). 
The use of active sensors such as the laser-induced fluorescence transient 
should be useful to assess photosynthetic dynamics of the crop and replace 
gas-exchange measurements. However, this technology still needs refinement 
to be applied practically on a large scale (Wyber et al., 2017, 2018). 

The canopy water status may be investigated using SRIs of specific 
wavelengths sensitive to water content. The most used of these indices are 
the water band index and normalized water band indices computed from the 
wavelengths from the 850-970 nm range (Prasad et al., 2007). However, as 
shown in Winterhalter et al. (2011), other SRIs have been investigated using 
various wavelengths in the visible-near infrared domain and wavelengths of the 
shortwave infrared domain. In addition to the canopy water content, canopy 
transpiration may be assessed using thermal infrared technologies (Jones 
et al., 2009), which measure the canopy temperature and its variation due to 
the control of stomatal opening/closure regulating water loss in response to 
changes in environmental conditions (Gonzalez-Dugo et al., 2014, Kelly et al., 
2019, Maes and Steppe, 2012, Munns et al., 2010). 


2.1.3 Phenological traits 


Crop phenology is usually scored by using Zadoks, Feekes-Large, or the Haun 
systems. The remote sensing community has been using Vls to estimate the 
stage of development of various crops using variations in the VIs during growth 
(Duncan et al., 2015, Piao et al., 2019, Yamasaki et al., 2017). Recent advances 
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in the phenotyping community allowing collection of images at a closer range 
to the crops have enabled an unprecedented spatial resolution. The use of 
RGB cameras, coupled with appropriate algorithms, has enabled detection and 
quantification of some of the important growth stages such as heading and 
flowering stages in cereal crops. 

The dynamic of ears per image may be used to estimate heading dates 
(Velumani et al., 2020, Wang et al., 2019). Flowering information has been 
collected in field conditions for various annual and perennial crops such 
as Lesquerella, apple tree, grapevine, rice, wheat, and cotton (Aquino et al., 
2015, Guo et al., 2015, Hocevar et al., 2014, Sadeghi-Tehran et al., 2017b, 
Thorp and Dierig, 2011, Xu et al., 2018). Different approaches have been 
used for the flower segmentation using threshold-based algorithms (Hocevar 
et al., 2014, Thorp and Dierig, 2011), for segmentation and quantification using 
mathematical morphology and pyramidal algorithm (Aquino et al., 2015), for 
identification of flowering within RGB images using k-means clustering and 
support vector machine (SVM) (Guo et al., 2015, Sadeghi-Tehran et al., 2017b), 
and for identification of flowers in three-dimensional (3D) images from aerial 
pictures using convolutional neural networks (Xu et al., 2018). 

The senescence period has also been investigated using RGB, multi-, and 
hyperspectral sensors. The temporal data of the senescence enable to derive 
parameters such as the onset, midpoint, and senescence duration, and the 
topic will be detailed in Section 2.3 (Anderegg et al., 2020, Borra-Serrano et al., 
2020, Christopher et al., 2014, Kipp et al., 2014, Montazeaud et al., 2016). 


2.2 Multi-sensor secondary traits modelling 


To increase the efficiency of agronomic trait prediction (e.g. biomass and 
physiological traits)) many approaches may be used such as (i) arithmetic 
combination of secondary traits to try to mimic as much as possible the 
agronomic trait, (ii) multi-linear/non-linear regression model to select the 
secondary phenotypes enabling better prediction, and (iii) multivariate analysis 
which summarizes secondary phenotypes information to predict the agronomic 
traits of interest. In this Section 2.2, only statistical methods used to predict 
traits without including genomic information are described. 


2.2.1 Arithmetic combination of secondary phenotypes 


AGB has been predicted using lidar outputs such as height and volume as well 
as using spectral Vis and FVC. The accuracy of the predictions varies depending 
on the crop, the type of trial, treatment, and stage of development. Height itself 
is not an indicator of biomass. The use of SRIs as a biomass predictor, as well as 
the FVC derived from the RGB camera, is often limited by the excess (saturating) 
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density of the cover, and height itself might not be sufficient enough as a 
biomass indicator. It has been shown that the combination of SRI or FVC (or 
both) with height improves biomass prediction. Yue et al. (2017) showed that 
the incorporation of the height data by multiplication or division of the eight VIs 
tested improved the biomass prediction in all cases. Maimaitijiang and Sidike 
(2019) developed a canopy volume index based on the parameters extracted 
from the crop height model from the orthomosaic image obtained by UAV 
and combine them with different SRIs to improve the biomass prediction with 
success. Another approach was taken by Jin et al. (2019) to predict biomass by 
estimating the volume ofthe plot. The authors combined height at anthesis time 
when it is considered to be at its maximum, the post-harvest stem diameter, and 
the number extracted from RGB images capturing the residual stems standing 
straight after harvest. Lu et al. (2021) also used a combination of VI and height 
measurement to estimate the canopy nitrogen content (CNC). They showed 
that the combination of height, VI, and FVC increased the CNC prediction. 

SRIs have been used extensively to determine the canopy nitrogen (N) status 
in various crops. However, as mentioned by Fitzgerald et al. (2010), measuring 
the N status by SRI was found to be challenging across the season, as the ground 
coverage and the canopy structure are both changing along with crop growth, 
as well asthere being an N dilution effect. The canopy chlorophyll content index 
(CCCI) has been developed to differentiate N from water stress in irrigated 
cotton and separate the confounding effects of canopy density (Barnes et al., 
2000, Clarke et al., 2001, Rodriguez et al., 2006). Fitzgerald et al. (2010) have 
developed a proxy of canopy nutrition index (CNI, also called nutrition nitrogen 
index), allowing the monitoring of crop N status at different growth stages 
without being affected by the N dilution effect. The CNI is built on the relationship 
between dry biomass and the % N of the vegetation. Based on the same 
principle, the CCCI, a planar domain VI measuring plant biophysical parameters 
in a mixed soil/plant pixel, can be built by analysing the relationship between 
one chlorophyll and one biomass-related SRIs plotted in a two-dimensional (2D) 
space (Clarke et al., 2001, Pancorbo et al., 2021). Another index, the water deficit 
index (WDI) has been developed in the 1990s to overcome the confounding N 
and water stress effect on the SRI. The WDI is also a planar domain index: it is 
based on the relationship between a VI such as the NDVI and the variable T, - T, 
from a thermal infrared sensor and by building 2D spaces, for the estimation 
of the evapotranspiration of the crops with less effect on the canopy density is 
possible (Moran et al., 1994, Pancorbo et al., 2021, Virlet et al., 2014). 


2.2.2 Multi-linear/non-linear regression 
In Section 2.2.1, most studies generally focused on single regression analysis 


based on a simple or a complex index to estimate the trait of interest. However, 
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the single regression approaches might not be sufficient to estimate integrative 
traits such as biomass and grain yield in cereals. Here, different approaches to 
integrate parameters using multiple regression analysis (linear or non-linear) 
are reviewed. The first part covers research works predicting traits based on 
a single date of data collection using multiple traits, and the second section 
covers studies with data collection on multiple days, irrespective of whether the 
date was used as an independent variable in the models. 

Multi-linear regression has been applied to estimate grain yield on maize, 
barley, and soft red wheat (Gracia-Romero et al., 2017, Kefauver et al., 2017, 
Pavuluri et al., 2015). On maize and barely, the best model for improving grain 
yield prediction from UAV imaging was a combination of two to five indices 
derived from RGB and/or multispectral imaging (Gracia-Romero et al., 2017, 
Kefauver et al., 2017). Atthe ground level, Pavuluri et al. (2015) showed models 
with three SRIs gave the best predictions for grain yield. On sorghum, Li et al. 
(2018) used simple and multiple exponential regression analyses to estimate 
fresh and dry biomass. In this case, a simple exponential regression model 
based on canopy height gave in most of the cases the best biomass prediction, 
with some improvements of the prediction by combining Vls. Vargas et al. 
(2019) used the Lasso algorithm and found that the best prediction for AGB 
in peas was a combination of Vls, FVC, and canopy volume. In the study by 
Santini et al. (2019), a combination of Vis and thermal infrared data gave the 
best prediction for the stem volume of adult trees of Pinus halepensis. 

Multiple regression has also been applied to nutrient content prediction. 
Gracia-Romero et al. (2017) used this approach to predict phosphorus (P) 
content in maize and improve P content prediction compared to simple 
regression analysis. However, the model prediction was still low. Pavuluri et al. 
(2015) had more success estimating grain N uptake, yield, and protein nitrogen 
use efficiency using models built from data collected at different growth stages 
(stem elongation, booting, heading, and grain filling). The predictions were 
better by combining the 2 years of measurement for each growth stage. 

Bendig et al. (2015) looked at biomass monitoring prediction in barley 
based on a combination of plant height and Vls using data collected prior and 
post-heading date. Models based on single and multiple exponential or linear 
regressions were tested using either the full data set or only the pre-heading 
data set. Cross-validation showed that the combination of Vls with plant height 
performed better than using the VIs alone, whatever the use of multiple linear 
or non-linear regression. In a different approach, Hoyos-Villegas et al. (2014) 
combined canopy height, FVC, AGB, photosynthesis, and rate of growth 
collected at 13 time points into multiple regression models to predict grain 
yield, AGB, photosynthesis rate, and rate of growth. The approach taken by 
the authors provided interesting model predictions for each of the parameters 
and showed that multi-sensor data produce highly successful predictions. The 
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main drawback was the upscaling feasibility to large populations for breeding 
purposes due to the amount of data to collect, especially for the gas exchange. 

Magney et al. (2016) used NDVI data and the parameters from the growth 
and senescence curves (growth rate and duration of NDVI for tillering, stem 
elongation, heading, and maturation) to predict grain yield, total biomass, 
protein content, and grain N yield using multi-linear regression models. The 
best results were obtained by the models using two or three phenological 
stages. This study also showed that using NDVI-derived phenological metrics 
from the early season (tillering and stem extension) substantially improved 
early prediction of yield and biomass, as compared to daily NDVI data, 
whereas protein and grain N were primarily driven by metrics associated 
with the reproductive development of the crop (heading and ripening). More 
recently, Buchaillot et al. (2019) used multiple regression analysis to predict 
grain yield in maize based on agronomic data coupled with (i) field sensors, 
(ii) ground imagery, and (iii) UAV imagery. Most of the models improved the 
grain yield prediction using four to nine parameters. This study also showed 
that the simple use of the date of anthesis, the duration from anthesis to the 
silking stage, the chlorophyll index given by the SPAD-502 at the vegetative 
and reproductive stage gave the best prediction for grain yield prediction (r? 
= 0.61). The model using UAV image data showed promising results (r? = 0.60) 
but used nine parameters instead of the four used in the previously mentioned 
model. 

Camino et al. (2018b) used multiple regression analysis to assess the 
contribution of sun-induced fluorescence (SIF) indices for nitrogen prediction 
in wheat. The performance of the built models with and without SIF was 
compared with the performance of models built with plant traits estimated by 
the PROSPECT-SAILH model inversion and with standard approaches based 
on single SRIs. The use of SIF indices increased the performance of the models 
based on the output of the inversion models (r? = 0.68-0.77 without SIF, r? = 
0.92-0.93 with SIF) and also outperformed the prediction for the chlorophyll 
content by the narrow-band indices (r? = 0.25-0.57). 


2.2.3 Partial least square regression 


Partial least square regression (PLSR), a multivariate analysis, has been proved 
useful for dealing with large numbers of parameters and volumes of data. The 
use of PLSR in plant phenotyping is increasing and shows promising results 
to estimate and predict different plant parameters. During the last decade, it 
has been used to estimate grain yield on barley, maize, and wheat (Barmeier 
and Schmidhalter, 2017, Elsayed et al., 2018, 2017, 2015, Garriga et al., 2017, 
Rischbeck et al., 2016, Weber et al., 2012), biomass at tillering and anthesis 
time on barley, maize, and wheat (El-Hendawy et al., 20192, Elsayed et al., 2018, 
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Montes et al., 2011), leaf nitrogen and chlorophyll content (Ecarnot et al., 2013, 
Silva-Perez et al., 2018, Vigneau et al., 2011), canopy water status on barley and 
wheat (El-Hendawy et al., 2019a, b, Elsayed et al., 2015, 2017), physiological 
and photosynthesis parameters on cotton, blueberry, tobacco (Fu et al., 2019, 
2020, Lobos et al., 2019, Meacham-Hensold et al., 2020, Silva-Perez et al., 2018, 
Thorp et al., 2015). Most of those studies are using the reflectance data as input 
for the PLSR. However, some authors have been using PLSR with SRIs directly. 
Elsayed et al. (2015) used SRIs from different sensors to estimate grain yield. 
The performance of the PLSR model improved grain yield prediction compared 
to the single regression model. El-Hendawy et al. (2019a) and El-Hendawy et al. 
(2019b)builta PLSR model based on SRIs and a selection of the most influential 
wavelengths. While in the El-Hendawy et al. (20193) study, both approaches 
gave similar results, the El-Hendawy etal.(2019b) study showed the PLSR model 
with higher values of r? and lower values of the root mean squared error (RMSE) 
between observed and predicted values of the measured parameters was the 
model based on SRIs. Similarly, Fu et al. (2020) compared the performance of 
the PLSR model based on reflectance data and based on SRIs to estimate two 
photosynthetic parameters (V... and Jaa) For V... the PLSR model based on 
SRIs gave better prediction than the model based on the wavelength, while for 
J a both approaches gave similar predictions. 

PLSR has also been used to integrate variables from different sensors. 
Montes et al. (2011) used this method to predict early biomass in maize based 
on height and reflectance data, while Rischbeck et al. (2016) used the method 
to predict grain yield in barley. Elsayed et al. (2017) used PLSR with different 
parameters to predict grain yield on wheat. Three PLSR models were built, one 
based on selected SRIs, one data fusion model based on SRIs and thermal 
infrared data, and one based on SRIs, thermal infrared data as well as relative 
water content and canopy water content. In the three last studies, the use of 
data fusion for PLSR improved the prediction showing the strong potential of 
the PLSR in plant phenotyping to integrate data from different sensors and 
cameras. 

Some authors have compared the PLSR model to other approaches such 
as machine learning algorithms, multi-linear regression models, or even to the 
predicted parameters from the inversion radiative transfer model PROSAIL. In 
the study by Lobos et al. (2019), the PLSR model performed better than the 
multi-linear regression model to estimate fluorescence parameters in blueberry 
based on reflectance data, while in the study by El-Hendawy et al. (2019b), the 
two approaches gave similar results for yield prediction. Garriga et al. (2017) 
showed that for most of the traits and conditions tested, the estimations 
provided by the ridge regression and modified SVM were the same or better 
than those provided by the SRIs. They also found that among the classification 
methods, the PLS-discriminant analysis showed the best performance, and 
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unlike the SRI and regression models, most traits were relatively well classified 
within a specific hydric condition. In the study by Montes et al. (2011), the SVM 
regression model to estimate biomass at the early stage performed slightly 
better than the PLSR. Fu et al. (2019) developed a framework combining 
six machine learning algorithms, including artificial neural network (ANN), 
SVM, least absolute shrinkage and selection operator, random forest (RF), 
Gaussian process and PLSR to optimize high-throughput analysis of the two 
photosynthetic variables, V... and Ja based on reflectance data. The PLSR 
model was the second-best model with r? and RMSE close to the best model. 
Thorp et al. (2015) used PLSR to estimate four phenotypes: leaf water content 
(C,), specific leaf mass (C, ), leaf chlorophyll a and b content (C), and LAI. 
The results showed the PSLR model was the most robust to estimate the four 
investigated parameters compared to linear regression, based on SRIs and the 
PROSAIL model inversion. 


2.3 Integrative secondary traits 


Different vectors (platforms) have been used to look at plant growth parameters 
(canopy height, FVC, and SRIs) such as handheld devices or telescopic poles 
(Anderegg et al., 2020, Cairns et al., 2012, Christopher et al., 2014, Grieder 
et al., 2015, Montazeaud et al., 2016, Velumani et al., 2020), tractor-based 
systems (Comar et al., 2012, Kipp et al., 2014), ground platforms (Aasen et al., 
2020, Beauchéne et al., 2019, Kronenberg et al., 2017), or UAV systems (Borra- 
Serrano et al., 2020, Burkart et al., 2018, Han et al., 2018, Holman et al., 2019, 
Holman etal., 2016, Varela etal., 2021). Anumber of these studies have collected 
time-series data to visually identify (i) the beginning of stem elongation period 
using canopy height data (Holman et al., 2016), (ii) growth stages on barley 
based on the growth profile (Burkart et al., 2018), (iii) rapid growth expansion 
and senescence dates and durations on soybean (Aasen et al., 2020), and (iv) 
growth pattern among maize breeding material (Han et al., 2018). 

Fitted curves maximize the use of time series data, quantifying and 
summarizing the different curve phases corresponding to the stage of the 
crop development: slow and rapid canopy development, a plateau, and the 
senescence phase. Fitted curve parameters enable the comparison of the 
growth and senescence pattern of multiple genetic materials for genetic 
studies. 

Grieder et al. (2015) used spline functions to smooth the FVC data 
and compute daily relative growth rate (RGR) to look at the influence of the 
temperature on early growth data for 29 random lines from wheat germplasm. 
Varela et al. (2021) also used spline fitting curves on FVC, height, and SRIs data. 
The author aimed to predict AGB on sorghum using the daily data and their 
corresponding RGR derived from the spline fitting using an RF algorithm and to 
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understand the relationships between growth dynamics, temporal resolution, 
and end-of-season AGB in 869 diverse accessions of a highly productive 
photoperiod sensitive sorghum. 

Cairns et al. (2012) used NDVI time series data collected over a growing 
season of a large set of tropical and subtropical maize inbred lines and single 
cross hybrids to look at their grain performance. NDVI data were used to 
compute the area under the curve (AUC) as an integrative indicator of growth 
and senescence patterns. The AUC was divided into sections corresponding 
to the different stages of development to investigate the link with grain yield. 
Comar et al. (2012) used a bi-linear model for investigating the SRIs time series 
to derive integrative parameters to compare the six wheat varieties growing 
with different nitrogen levels and planted at different densities. More recently, 
Borra-Serrano et al. (2020) used a sigmoidal model (Gompertz) and Beta 
functions to fit FVC and canopy height data. The authors derived parameters 
including maximum absolute growth rate, early vigour, maximum height, and 
senescence for a collection of soybean genotypes and were integrated into 
multi-linear regression model to estimate seed yield. 

Some studies have only focused on the senescence period. Christopher 
et al. (2014), Kipp et al. (2014), and Montazeaud et al. (2016) used NDVI and 
other spectral data to model the dynamics of the senescence to identify stay- 
green phenotype using logistic, PLSR, and linear models, respectively. From the 
fitted curves, parameters such as the onset and mid-point of senescence and 
the senescence rate have been used to characterize the stay-green phenotype. 
Anderegg et al. (2020) have investigated different SRIs and modelling 
approaches to fit the data collected during the senescence period. PLS and 
cubist regressions have been used on the SRIs, and the derived parameters 
were assessed by prediction of the grain protein content and grain yield. 

The last few years have seen an increase in interest to quantify the number 
of reproductive organs in cereal crops, as mentioned earlier. Wang et al. (2019) 
and Velumani et al. (2020) have been using a series of daily RGB images to 
quantify the number of ears. The authors fitted a logistic curve on the ears data 
and extracted the inflection point corresponding at half of the total ears to 
estimate the heading date. 


3 Genotype-to-phenotyping (G2P) models: integrating 
data from phenomics and envirotyping in predictive 
breeding 


3.1 Implementing complex G2P models in breeding programs 


It is well known that increasing productivity largely depends on genotype 
(G) and environment (E). With the newly developed technologies, the G 
component can now be accessed through multi-omics layers (Scossa et al., 


Published by Burleigh Dodds Science Publishing Limited, 2022. 


Digital phenotyping and genotype-to-phenotype models 15 


2021), the phenotype (P) component can be precisely measured using HTPP 
tools (Araus et al., 2018), and the E factors can be evaluated using envirotyping 
techniques (Xu, 2016). In this section, how to integrate these new phenotyping 
data into G2P models to improve the prediction of complex traits is discussed. 
The utilization of HTPP data for genetic dissection (gene/trait discovery) of 
quantitative traits has been discussed previously (Moreira et al., 2020, Zhang 
et al., 2021, Mir et al., 2019, Brown et al., 2014, Li and Sillanpaa, 2015). 

G2P models are considered here as different types of statistical models 
(e.g. linear, mixed, factorial regression, crop growth) that allow the identification 
of genetic and environmental factors driving phenotypic variation (Van Eeuwijk 
et al., 2019, Cooper et al., 2021). G2P models have a wide range of applications 
in breeding, mainly exploiting the genotype x environment interactions (GEI). 
Environmental effects and GEI are estimated through METs by evaluating 
genotypes in many locations and years. Traditionally, breeders have been 
using the final trait value for selection, which is the cumulative effect of many 
factors including genotype, environment, and interactions among them. 
Consequently, a simple linear model can be used to untangle the phenotypic 
variance quantified as Vp — Va +V: - Vae +V,, where V, stands for genotypic 
variance, V; for environmental variance, Vac for GEI variance, and Ve represents 
the random residual. Note that there are many strategies to handle GEI and the 
most common approaches are the mixed models, the additive main effects and 
multiplicative interaction models, the genotype plus genotype-by-environment 
models, the analysis of variances, and the stability measures (Van Eeuwijk et al., 
2016, Malosetti et al., 2013). 

Recently, the term 'envirotyping' has been introduced in the context of GEI, 
and its integration with the new genotyping and phenotyping tools will help 
support breeding programs speeding up the development of high-yielding 
cultivars (Cooper et al., 2014, Van Eeuwijk et al., 2019). Envirotyping refers to 
the process of characterizing environmental factors such as climate, soil, biotic, 
and crop management (Xu, 2016). Precise characterization of E factors provides 
the key component for refining field experiments, reducing environmental 
variance, and thus increasing heritability estimation. Environment-related data 
have been applied in predictive G2P models resulting in improvements in 
predictive ability (Pérez-Rodríguez et al., 2015, De Los Campos et al., 2020). 
For example, Jarquín et al. (2014) and Pérez-Rodríguez et al. (2015) used 
68 and 76 environmental covariates in wheat and cotton, respectively. New 
interest has been going towards 'enviromics' for the exploration of key genetic 
components linked to environmental factors or development stages (Crossa 
et al., 2021, Resende et al., 2021). Another hot topic is the establishment of 
a four-dimensional (4D) profile with G-P-E information incorporated with 
time series (Xu, 2016), but this process has to be integrated with the new 
phenotyping data in breeding pipelines. 
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Inrecentyears, HTPP tools have been used extensively in croptrials generating 
large amounts of data with high spatial and temporal resolution (Mir et al., 2019, 
Rebetzke et al., 2019). Many approaches (discussed in Section 2) already have 
high potential to be integrated into the routine of breeding pipelines such as (i) 
estimating NDVI to characterize biomass accumulation and canopy greenness 
in the phenological stages (Rebetzke et al., 2016), (ii) identifying flowering time 
and the number of spikes in wheat using sequential images (Sadeghi-Tehran 
et al., 2017b, 2019), (iii) using canopy temperature to identify drought-tolerant 
genotypes, and (iv) quantification of the severity of infection (Yu et al., 2018). 
The primary goal in breeding programs is the improvement of yield (commonly 
referred to as target or primary trait), but characterizing secondary phenotypes 
could be of great value in selecting yield as they are genetically correlated. 
Secondary traits can be characterized as intermediate (biomass, flowering 
time, grain number) or basic (sensitivity to photoperiod, water uptake capacity, 
radiation use efficiency) phenotypes. Similarly, hundreds of hyperspectral data 
can be used as secondary traits for predicting yield (Fig. 1). However, it is not 
clear which secondary traits should be prioritized and how they can be integrated 
into G2P models (Van Eeuwijk et al., 2019). Furthermore, additional costs for 
phenotyping secondary traits may not necessarily translate into the improved 
predictive ability for the primary trait (Bustos-Korts et al., 2019). 

Despite grain yield being the primary trait of interest, breeders might also 
be interested in using other phenotypes collected through phenotyping tools 
such as canopy height and FVC, to use in combination with secondary traits 
such as hyperspectral data (Fig. 1). In this case, indirect selection and selection 
index are two common methods used when seeking improvements of yield 
in a target population of environments. In indirect selection, a primary trait 
y is selected indirectly by selecting for another genetically correlated trait x 
(Fischer and Rebetzke, 2018) that can be quantified using HTPP, for example 
using spectral reflectance traits to select materials with high yield (Lozada et al., 
2020, Kyratzis et al., 2017). In the selection index (SI) context, a primary trait y is 
selected based on an index calculated from a set of t secondary traits. A linear 
index is built as follows: 


where b, and t, are the weight and phenotypic value of the trait, respectively. The 
weight represents the relative importance of each trait. Optimization of SI applying 
hyperspectral data to reduce overfitting and sub-optimal accuracy of indirect 
selection has been proposed in the recent literature (Lopez-Cruz et al., 2020). 

The development of robust G2P frameworks that can accommodate these 
high-dimensional secondary traits together with genomic and environmental 
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Figure 1 Schematic representation of the integration of HTPP data into genotype- 
to-phenotype (G2P) predictive models in crop breeding. (a) The first approach is to 
integrate secondary traits into G2P models to improve the prediction of target or 
primary phenotypes, that is, grain yield (GY). Secondary phenotypes are represented by 
whole hyperspectral reflectance data (e.g. near-infrared reflectance spectroscopy) and 
vegetation (VI) and pigmented (PI) indices. Secondary traits could be incorporated (1) 
in a linear mixed model (LMM) using hyperspectral data as predictors (single kernel) in 
an approach called phenomic prediction or by using a multivariate linear mixed model 
(MvLMM), (2) using other frameworks as functional and factorial regression, and (3) 
by using crop growth models. (b) The second approach is to explore G2P models for 
longitudinal traits to capture plant functioning at different stages of crop development. 
The most common predictive G2P models used for such an approach are traditional 
MvLMM (with specific variance-covariance) and random regression model.TP, time point; 
MegaLMM, mega-scale linear mixed model. 


dataisrequired and has been discussedinrecentreviews(Van Eeuwijketal., 2019, 
Morota et al., 2019). Primarily, this includes (i) extraction and characterization 
of HTPP secondary traits and environmental data, (ii) robust experimental 
design management (e.g. correction for spatial factors), (iii) dynamic modelling 
(e.g. time point (TP) and environment), and (iv) target trait prediction (using 
robust approaches as mixed and crop growth models). For a more detailed 
description of this workflow, see the study by Van Eeuwijk et al. (2019). Briefly, 
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images are pre-processed (alignment, calibration, and segmentation) to extract 
informative data. This step can be laborious and time-consuming as large 
amounts of information will be reduced to a small number of variables relevant 
for phenotypic and genomic prediction. In the second stage, adjustment of 
genotypic means including spatial variation correction is required. Analyses 
can be performed per TP (also referred to as time series, longitudinal, and 
repeated measurements) using a mixed-model approach including first-order 
auto-regressive structure or 2D P-spline basis to row and column directions. 
In the third scenario, genotypic means are treated as unique observations 
and modelled in the function of time (in case the target trait is longitudinal). 
Parametric (logistic, Gompertz, and exponential growth functions) and spline (P 
and B) models are used to capture the dynamics of traits, for example, canopy 
height, AGB, leaf area. Lastly, after estimating genotype-specific parameters, we 
need to expand this approach to multiple environments. Therefore, a function 
integrating both factors can effectively adjust for temporal and environmental 
gradients (Chenu et al., 2009, Cooper et al., 2021). 

In summary, G2P models can predict target traits from genotype-specific 
and environmental inputs. In the future, more measurements using field 
phenotyping tools in multiple environments will be available; therefore, G2P 
models must be fine-tuned to incorporate secondary phenotyping information 
in a breeding pipeline in an efficient way. 


3.2 Combining HTPP data into predictive G2P models 


The integration of HTPP and genomic data has been a current reality in plant 
breeding programs (Araus et al., 2018). Many adjustments in standard G2P 
models have been made to account for the complexity of secondary traits in GS 
approaches to improve the selection efficiency of the target trait (Table 1). The 
most used method in whole-genome regression models is the genomic best 
linear unbiased prediction (GBLUP), which utilizes a genomic relationship matrix to 
estimate the genetic merit of an individual (Morota and Gianola, 2014). The matrix 
defines the covariance between individuals based on observed similarity at the 
genomic level, rather than on expected similarity based on pedigree. Therefore, 
integrating HTPP data into genomic selection schemes can be classified into two 
common statistical methods: univariate and multivariate approaches. 


3.2.1 Univariate predictive G2P models 


The univariate model focuses on predicting one primary trait collected either 
using manual phenotyping (grain yield) or high-throughput platforms (e.g. 
canopy height, NDVI, senescence). Several studies used traits collected using 
UAV astargettraits in GP studies, reducing the labour cost compared to traditional 
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measurements (Watanabe et al., 2017). In a mixed-model perspective, GP can be 
performed with these mentioned traits using a single- or multi-kernel derived 
from molecular markers and/or traits derived from image data as described in 
Section 2. The type of hyperspectral data (whole spectra vs. SRIs) and feature 
selection (e.g., based on heritability) which should be used in G2P models have 
been discussed (van Eeuwijk et al., 2019, Tardieu et al., 2017). When using only 
genetic variants (or different layers like gene expression) to predict the genetic 
values, it is referred to as genomic prediction. Alternatively, phenomic selection 
or phenomic prediction is referred to when using only spectral data or any other 
data collected through digital phenotyping (Rincent et al., 2018). 

Phenomic selection was first introduced by Rincent et al. (2018) using only 
secondary traits such as near-infrared spectroscopy (NIRs), in the same way as 
genomic regressors (genomic kernel), allowing predictions of untested wheat 
and poplar (Populus nigra L.) genotypes. The authors observed higher predictive 
ability compared to a single genomic kernel. Lane et al. (2020) further explored 
this approach predicting grain yield in maize genotypes using NIR spectra as 
kernel and observed promising results. In practical terms, phenomic selection 
might be capturing the phenotypic relationship between individuals from a 
biological composition basis (Rincent et al., 2018). The reflectance spectrum 
is a result of numerous wavelengths capturing chemical bonds in the analysed 
tissue. Interpretation of phenomic prediction results must be done with caution 
as they do not directly model the genetic component of correlations between 
primary and secondary traits (Runcie et al., 2021). 

Suppose we have three scenarios of predictive G2P models from a 
mixed-model perspective (ignoring environment-specific information and TP 
variables) as follows: 


y =X b+Zg + e(single-kernel genomic prediction) (1) 
y =X &«ZT + e(single-kernelphenomicprediction) (2) 
y -X5«Z,g - ZT +e(multi-kernelprediction) (3) 


where y is a vector of phenotypic values, B is a vector of fixed effects, g is a 
vector of random additive genetic effects of markers, t is a vector of random 
secondary trait (e.g., whole hyperspectral, NIR) effects, and & is a vector of 
random residuals. The incidence matrices for B, g, and t are X, Z, and Z, 
respectively. The random effects are independent and normally distributed 
g - N(0,736, ), t - N(0,z:6. ), and e- del k where I was the identity 
matrix. G, is a kernel or relationship matrix (covariance matrix describing 
genomic similarities between pairs of genotypes) using Gg =WW '/m, where 
W is a n x m matrix of scaled and centred markers from n individuals and m 
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is the total number of markers. G, is a kernel or covariance matrix describing 
similarities based on secondary trait information between pairs of genotypes 
using G, -TT '/s , where T is a n x s matrix of scaled and centred secondary 
traits from n individuals and s is the total number of secondary traits. 

Several studies have integrated HTPP data in predictive models using 
the multi-kernel approach mentioned above and expanded further on the 
integration of GEI and envirotyping (Montesinos-Lopez et al., 2017b, Aguate 
et al., 2017, Rutkoski et al., 2016, Krause et al., 2019, Gonçalves et al., 2021). 
For example, Aguate et al. (2017) incorporated wavelengths into a GP model 
in maize hybrids showing that using wavelength as predictors increased 
predictive ability compared to Vls. Montesinos-Lopez et al. (2017b) compared 
the full and a subset of bands when predicting grain yield on wheat concluding 
that using all wavelengths resulted in higher predictive ability. Recently, Krause 
et al. (2019) used a multi-kernel GBLUP approach combining marker, pedigree, 
and hyperspectral reflectance kernels and observed the highest predictive 
ability compared to single kernels when predicting genotypes. Cuevas et al. 
(2019) applied a non-linear arc-cosine kernel (deep learning artificial neural 
networks) in wheat data set demonstrating promising results. 


3.2.2 Multivariate predictive G2P models 


The multivariate approach focuses on predicting multiple traits as response 
variables using single kernel from molecular markers. Therefore, phenotypes can 
be a combination of the targeted trait (mainly yield) and the correlated secondary 
traits (usually VIs). This approach is expected to improve predictive ability 
relative to single trait by enabling information to be shared among correlated 
traits, particularly for traits showing low heritability. Initial implementation of such 
analyses was published by Rutkoski et al. (2016) followed by Sun et al. (2017) 
and Crain et al. (2018). These three publications used canopy temperature 
and NDVI as secondary traits to improve the predictive ability of grain yield 
in wheat. Rutkoski et al. (2016) showed that the integration of both types of 
secondary traits into single-kernel GBLUP model increased the predictive 
ability by 7096 compared to the univariate model. Sun et al. (2017) showed 
that the incorporation of secondary traits in both training and testing sets in a 
multivariate model yielded the best results in terms of predictive ability. Hayes 
et al. (2017) observed improvements in predictive ability when incorporating 
NIR and nuclear magnetic resonance spectra as secondary traits, in a predictive 
multi-trait framework in wheat. Crain et al. (2018) included grain yield, canopy 
temperature, and VI measurements in a multi-trait model showing a slight 
increase (7% gains) in predictive ability when compared to the single-trait model. 

Direct implementation of thousands of hyperspectral spectra in a traditional 
G2P multivariate model would be computationally demanding. Several authors 
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have used feature selection to identify those traits that are relevant (Crain et al., 
2018, Rutkoski et al., 2016, Sun et al., 2017). Other authors have replaced the 
multivariate method with a direct regression on the secondary phenotype 
data, for example using secondary traits (wavelengths or environmental input) 
as functional covariates in functional regression (Montesinos-Lopez et al., 
20172, b) and factorial regression (van Eeuwijk et al., 2018) models. A review on 
functional regression analysis using HTPP data has been written by Montesinos- 
Lopez et al. (2018). Alternative methods have also been proposed to handle 
efficiently this high complexity data such as the 'item-based collaborative 
filtering' (IBCF) (Juliana et al., 2019) and regularized selection indices (Lopez- 
Cruz et al., 2020). For example, the IBCF approach can be effectively applied 
in scenarios where trait correlations with the primary target (grain yield) are 
moderate to high (Juliana et al., 2019). 

While these methods mentioned above (e.g. functional regression) are 
powerful alternatives to the traditional multi-trait model, they do not explicitly 
account for the high-dimensional genetic correlations between traits. To 
overcome this issue, Runcie et al. (2021) proposed a multivariate linear mixed 
model (MegaLMM) taking advantage of the factor-analytic approach. They used 
the original data set from Krause et al. (2019) and observed higher predictive 
ability when using MegaLMM compared to the multi-kernel (genomic + 
phenomic) approach. According to the authors, by directly modelling the 
genetic covariance between the hyperspectral reflectance traits, MegaLMM 
should be more efficient, mainly when secondary traits and the target (focal) 
traits are measured on the same plants. 


3.2.3 Predictive G2P models using longitudinal (time-series) traits 


Measuring plant development has become feasible with the advancements 
of HTPP in recent years. A 4D profile can be built based on the genotype, 
phenotype, and envirotype (3D profile) while the fourth involves the 
developmental stages (temporal data) (Xu, 2016). For example, collecting time- 
series data enables the comparison between the canopy height of different 
accessions at the same growth stage in sorghum (Watanabe et al., 2017) and also 
allows the selection of high-yielding cultivars at an early plant developmental 
stage (Sun et al., 2017). Analysing G2P models from different TPs collected 
in the same environment is straightforward, but it is more difficult when 
measurements are made across environments. Also, Lyra et al. (2020) pointed 
out the importance of sampling data in later stages of growth development to 
capture more genetic variance. Therefore, many G2P frameworks have been 
adapted to incorporate information from new phenotyping and envirotyping 
data across developmental stages. A review on using longitudinal data from 
HTPP in predictive G2P models has been discussed by Moreira et al. (2020). 
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For longitudinal data, a simple univariate mixed model can be fitted at each 
TP for traits like FVC, canopy height, senescence, and biomass accumulation. 
However, to capture the covariance between TPs, a multivariate approach can 
be used allowing each TP to have a unique variance and covariance between 
TPs. Other common methods used to model longitudinal data more flexibly 
are random regression and factorial regression (Baba et al., 2020, Campbell 
et al., 2018, Momen et al., 2019). Random regression model (RRM) (Schaeffer, 
2016, Schaeffer, 2004) is used to estimate covariance functions and to 
model the shape/trajectory of observations taken over time. This trajectory 
(covariance across TPs) can be modelled using Legendre polynomials and 
splines. Sun et al. (2017) applied an RRM using a cubic smoothing spline in 
wheat aiming to capture the trait development (canopy temperature and 
NDVI) during the growth stages across multiple environments. Campbell 
et al. (2018) observed improvements of predictive ability using RRM using 
Legendre polynomials to model shoot growth trajectories in rice when 
compared to single-time prediction. Lyra et al. (2020) also observed higher 
predictive ability compared to single TP using B-spline basis coefficients as 
phenotypes after adjusting the values using a factor analytic model in 26 TPs 
for canopy height in wheat. 


4 Conclusion 


In this chapter, the current methods accessible for monitoring crop canopy 
development and nutrition, for predicting grain yield and yield components, as 
well as the various approaches available for integrating data from phenomics 
and envirotyping in predictive breeding, have been reviewed. Whilst much 
has been achieved in the last decade, digital phenotyping as a tool to support 
breeding programs is still in its infancy. A limitation has been the application 
of digital phenotyping into the breeding process. The temporal and spatial 
resolution required to look at a trait of interest is likely to be different at earlier 
or the later stages of the selection programs, requiring tools to scan hundreds 
to thousands of genetic materials or fined-tuned tools for detailed analysis of 
the performance of the targeted traits on a sub-sample of a population. Such 
considerations define the cost-benefit of the technology deployed. Further 
limitations are the ability to deploy such technologies at different locations and 
the coordination of the data acquisition in multi-environment studies. Finally, 
the level of data integration to develop efficient pipelines to integrate digital 
phenotyping data into G2P models still requires further development. None 
of these points should be insurmountable with appropriate investment. The 
last decade has seen an exponential increase in terms of initiatives and studies 
around digital phenotyping applications to breeding, which should lead to 
significant breakthroughs in the next decade. 
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6 Where to look for further information 


The following articles provide a good overview of the subject: 


e Crossa, J., Perez-Rodriguez, P., Cuevas, J., Montesinos-Lopez, O., Jarquin, 
D., de los Campos, G., Burgueno, J., Gonzalez-Camacho, J. M., Perez- 
Elizalde, S., Beyene, Y., Dreisigacker, S., Singh, R., Zhang, X. C., Gowda, 
M., Roorkiwal, M., Rutkoski, J. and Varshney, R. K. 2017. Genomic selection 
in plant breeding: methods, models, and perspectives. Trends in Plant 
Science 22, 961-975. 

e Pask, A. J. D., Pietragalla, J., Mullan, D. M. and Reynolds, MP. (Eds.) 2012. 
Physiological Breeding II: A Field Guide to Wheat Phenotyping. Mexico, 
D.F.: CIMMYT. 

* van Eeuwijk, F. A., Bustos-Korts, D., Millet, E. J., Boer, M. P., Kruijer, W., 
Thompson, A. Malosetti, M., Iwata, H., Quiroz, R., Kuppe,C., Muller, O., Blazakis, 
K. N., Yug, K., Tardieu, F. and Chapman, S. C. 2018. Modelling strategies for 
assessing and increasing the effectiveness of new phenotyping techniques 
in plant breeding. Plant Science. doi: 10.1016/j.plantsci.2018.06.018. 

* Reynolds, M. and Langridge, P. 2016. Physiological breeding. Current 
Opinion in Plant Biology 31, 162-171. 


Further information about infrastructure and initiatives about phenotyping: 


e |PPN (International Plant Phenotyping Network) is an association 
representing the major plant phenotyping centers and providing relevant 
information about plant phenotyping: www.plant-phenotyping.org/IPPN 
. home. 

e EMPHASIS is a project aiming to address the limitations of the current 
phenotyping to fully exploit the genetic and genomic resources available 
for crop improvement in changing climate: https://emphasis.plant 
-phenotyping.eu. 


Database for spectral reflectance indices: 


e L3 Harris Geospatial documentation center is regrouping a vast database of 
spectral indices: www.l3harrisgeospatial.com/docs/vegetationindices.html. 
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7 Abbreviations 


AGB Above-ground biomass 

ANN Artificial neural network 

AUC Area under the curve 

C,, Leaf chlorophyll a and b content 
CCCI Canopy chlorophyll content index 

c. Specific leaf mass 

CNC Canopy nitrogen content 

CNI Canopy nutrition index 

CT Canopy temperature 

C. Leaf water content 

F' Steady-state fluorescence 

Fa Fa Photosystem Il operating efficiency 
FVC Fraction vegetation cover 

G Genotype 

G2P Genotype to phenotype 

GBLUP Genomic best linear unbiased prediction 
GEBV Genomic estimated breeding value 
GEI Genotype by environment interaction 
GP Genomic Prediction 

GP Gaussian process 

GS Genomic selection 

HTPP High-throughput plant phenotyping 
IBCF Item-based collaborative filtering 

LAI Leaf area index 

LUE Light use efficiency 

MegaLMM Mega-scale linear mixed model 

MET Multi-environment trials 

N Nitrogen 

NDVI Normalized difference vegetation index 
NIRs Near-infrared spectroscopy 

NPO Non-photochemical quenching 

P Phenotype 

PLSR Partial least square regression 

PRI Photochemical/physiological reflectance index 
PSII Photosystem 2 

RF Random Forest 

RGB Red-green-blue 

RMM Random regression model 

SfM Structure from motion 

SI Selection index 
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SIF Sun-induced fluorescence 
SRI Spectral reflectance index 
SVM Support vector machine 
Ler Canopy minus air temperature 
TP Time point 

UAV Unmanned aerial vehicle 
VE Environmental variance 
V. Random residual variance 
Vae GEI variance 

VI Vegetation index 

VP Phenotypic variance 

WDI Water deficit index 
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